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1  Statement  of  the  problems  studied 

This  project  has  investigated  novel  circuit  architectures  for  low-energy  VLSI  systems.  Its 
primary  focus  has  been  on  charge-recovering  (a.k.a.  adiabatic)  circuits.  By  steering  cur¬ 
rents  across  devices  with  low  voltage  drops  and  by  recycling  undissipated  energy,  these 
circuits  can  operate  more  efficiently  than  their  conventional  digital  counterparts.  Early  in¬ 
vestigations  into  adiabatic  circuits  have  yielded  very  complex  designs  that  are  impractical 
for  high-speed  design.  This  project  has  led  to  the  discovery  of  extremely  simple  charge¬ 
recovering  circuits  which  achieve  very  high  energy  efficiency  at  relatively  high  operating 
frequencies.  The  results  of  this  research  have  been  validated  through  experimental  sili¬ 
con  prototypes.  For  three  of  the  inventions  that  came  out  of  this  project,  the  University  of 
Michigan  has  filed  patent  applications  with  the  United  States  Patent  and  Trademark  Office. 

In  addition  to  charge-recovering  circuits,  this  project  has  also  explored  tools  and  ar¬ 
chitectures  for  conventional  low-energy  CMOS  design.  The  focus  has  been  on  design  au¬ 
tomation  tools  that  can  provide  quick  and  accurate  estimates  of  power  dissipation  from 
register-transfer  level  descriptions  of  programmable  digital  systems.  Low-power  signal 
processing  architectures  for  communication  applications  have  also  been  explored. 


2  Summary  of  the  most  important  findings 

The  main  contributions  of  this  research  project  were  the  following. 

•  Novel  source-coupled  adiabatic  logic  family  (SCAL).  The  proposed  circuit  family 
operates  with  a  single-phase  power-clock  waveform.  In  Hspice  simulations,  arith¬ 
metic  units  designed  in  SCAL  achieve  up  to  10  times  higher  energy  efficiency  than 
their  conventional  counterparts  with  0.5/rni  process  parameters  and  operating  speeds 
exceeding  200MHz.  An  8-bit  multiplier  chip  designed  in  SCAL  received  the  First 
Prize  in  the  Operational  Category  of  the  VLSI  Design  Contest  of  the  2001  ACM/IEEE 
Design  Automation  Conference. 

•  Charge  recovering  flip-flop  and  associated  resonant  clock  generator:  This  circuitry 
enables  the  recovery  of  charges  from  the  clock  distribution  network,  providing  near¬ 
zero  energy  consumption  when  the  input  data  is  not  switching  and  yielding  the  power 
savings  of  clock  gating  approaches  without  the  additional  complexity  of  implement¬ 
ing  clock  gating  in  the  design. 

•  Low-power  charge-recovering  static  memories  (SRAMs):  The  proposed  SRAMs  rely 
on  a  novel  driver  to  recover  charges  from  the  bit/word  lines,  thus  reducing  the  power 
consumption  associated  with  driving  substantial  capacitive  loads. 

•  Low-power  dynamic  memories  (DRAMs):  To  reduce  the  power  associated  with  re¬ 
freshing  memory  cells  too  often,  these  DRAMs  perform  refreshing  on  a  block  (rather 
than  an  entire  row)  basis  and  in  conjunction  with  multiple  refresh  periods 

•  A  reconfigurable  pipeline  architecture  for  energy-efficient  multimedia  processing. 
These  pipelines  adapt  their  performance  and  dissipation  to  the  required  data  rates 
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(rather  than  the  worst-case  ones)  by  disabling  and  bypassing  a  select  subset  of  pipeline 
registers. 

•  Low-power  design  methodology  for  high-throughput  large-state  Viterbi  decoders. 

•  System-level  power  estimation  methodology  for  VLSI  systems  consisting  of  intel¬ 
lectual  property  (IP)  components. 

The  remainder  of  this  section  provides  more  details  about  each  of  our  main  contribu¬ 
tions. 

2.1  Source-coupled  adiabatic  logic 

Adiabatic  circuit  architectures  reduce  energy  dissipation  by  steering  currents  across  devices 
with  low  voltage  differences  and  by  recycling  the  energy  stored  in  their  capacitors  [1,  6]. 
Broadly  speaking,  the  efficient  operation  of  these  designs  is  yet  another  manifestation  of 
energy-speed  trade  offs.  Consequently,  adiabatic  circuits  that  operate  very  efficiently  at 
low  operating  frequencies  stop  functioning  at  high  data  rates.  On  the  other  hand,  adiabatic 
circuits  with  broad  operating  ranges  tend  to  be  dissipative  at  low  frequencies. 

This  project  has  led  to  the  discovery  of  SCAL,  a  cross-coupled  adiabatic  logic  fam¬ 
ily  that  achieves  very  low  power  dissipation  across  a  wide  range  of  operating  frequencies. 
High  speed  is  ensured  by  activating  a  sense-amplifier  structure  in  a  nonadiabatic  manner. 
High  energy  efficiency  across  a  broad  frequency  range  is  achieved  by  providing  each  gate 
with  a  current  source  that  can  be  tuned  by  transistor  sizing  to  achieve  optimal  charging 
rates  for  its  operating  conditions.  In  addition  to  low  energy  consumption  and  broad  operat¬ 
ing  spectrum,  SCAL  features  include  true  single-phase  operation,  balanced  clock  loading, 
functional  completeness,  and  straightforward  cascading. 


Figure  1:  (a)  A  PMOS  and  (b)  an  NMOS  inverter  in  SCAL. 

SCAL  is  a  partially-adiabatic  logic  family  that  is  clocked  by  a  single-phase  power- 
clock.  Low  energy  operation  is  achieved  across  a  broad  operating  range  by  tuning  a  current 
source  attached  to  each  gate.  The  basic  structure  of  a  SCAL  PMOS  gate  is  shown  in  the 
PMOS  inverter  of  Figure  1(a).  This  inverter  comprises  a  pair  of  cross-coupled  latches 
(MPl  and  MP2),  a  pair  of  current  control  switches  (MP3  and  MP4),  two  function  blocks 
(MP5  and  MP6)  and  a  current  source  (MP7).  This  current  source  is  the  main  structural 
characteristic  that  differentiates  SCAL  from  TSEL,  its  closest  adiabatic  relative.  The  charge 
flow  rate  through  the  current  source  is  controlled  by  the  W/L  ratio  of  MP7.  The  constant 
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voltage  supply  Vdd  is  required  for  activating  the  cross-coupled  latches.  The  port  PCLK  is 
used  to  apply  a  sinusoidal  power-clock  <3?. 

A  PMOS  SCAL  gate  operates  in  two  phases:  discharge  and  evaluate.  The  energy  stored 
in  the  node  out  or  out  is  recovered  during  discharge.  The  new  output  of  a  PMOS  gate 
is  computed  during  evaluate.  SCAL  cascades  are  built  by  stringing  together  alternating 
PMOS  and  NMOS  gates.  The  only  signal  required  for  controlling  a  SCAL  cascade  is  the 
power-clock  <f>. 

In  simulations  from  layout  with  0.5/im  CMOS  process  parameters,  4-bit  pipelined 
carry-lookahead  adders  (CLAs)  developed  in  SCAL  functioned  correctly  for  operating  fre¬ 
quencies  ranging  from  below  10MHz  up  to  280MHz.  In  comparison  with  2N-2P  and  TSEL 
[16,  18],  adiabatic  families  that  use  a  similar  cross-coupled  latch  structure  and  are  capable 
of  high-speed  operation,  our  logic  was  consistently  less  dissipative.  In  comparison  with 
PAL  [22,  28],  a  very  energy  efficient  family  at  low  frequencies,  our  logic  was  50%  more 
dissipative  at  10MHz  and  less  dissipative  for  operating  frequencies  exceeding  100MHz.  In 
comparison  with  their  static  CMOS  counterparts,  our  designs  dissipated  about  one  third  as 
much  energy  at  280MHz.  At  10MHz  their  dissipation  was  lower  than  CMOS  by  more  than 
an  order  of  magnitude. 

To  demonstrate  the  robustness,  efficiency,  and  practicality  of  SCAL,  we  used  a  diode- 
based  variant  of  its  basic  topology  to  design  an  8-bit  unsigned  multiplier  chip.  The  chip 
included  built-in  self-test  logic,  an  integrated  resonant  clock  generator,  and  circuits  for  con¬ 
verting  between  adiabatic  and  CMOS  signaling  conventions.  With  its  11,854  transistors, 
our  design  was  sufficiently  large  and  complex  to  enable  a  thorough  exploration  of  several 
issues  that  are  central  to  adiabatic  chip  design. 

In  comparison  with  a  voltage  scaled,  pipelined,  static  CMOS  multiplier,  the  adiabatic 
multiplier  dissipates  less  energy.  At  200MHz,  it  is  roughly  4  times  more  energy  efficient 
than  its  CMOS  counterpart,  dissipating  only  130pJ  per  operation  with  a  2.7V  peak  supply. 
While  operating  in  self-test  mode  at  a  clock  rate  of  100MHz,  it  dissipates  approximately 
91pJ  per  operation  with  a  2.2V  peak  supply.  These  efficiencies  were  obtained  without 
relying  on  any  optimization  tools  and  despite  our  conservative  design  approach  that  was 
primarily  aimed  at  obtaining  a  working  chip  and  thus  ignored  substantial  energy  optimiza¬ 
tion  potential. 


Figure  2:  Microphotograph  of  test  chip. 

The  multiplier  chip  was  awarded  First  Prize  in  the  Operational  Category  of  the  VFSI 
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Design  Contest  of  the  2001  ACM/IEEE  Design  Automation  Conference.  It  was  fabricated 
in  a  standard  3-metal,  1-poly,  0.5/nn  CMOS  process  through  MOSIS.  A  microphotograph 
of  the  chip  is  shown  in  Figure  2.  Correct  chip  operation  was  experimentally  validated  at 
frequencies  up  to  130MHz,  limited  by  the  bandwidth  of  the  off-chip  interface.  For  operat¬ 
ing  speeds  up  to  130MHz,  power  dissipation  measurements  correlate  well  with  simulation 
results  under  identical  operating  conditions.  The  correct  operation  of  the  integrated  clock 
generator  at  frequencies  over  140MHz  has  also  been  experimentally  validated,  although 
a  similar  measurement  bandwidth  problem  prevented  accurate  combined  power  measure¬ 
ments  at  these  frequencies. 

2.2  Charge-recovering  flip-flop  and  clock  generator 

A  popular  approach  to  low-energy,  high-throughput  VLSI  system  design  is  voltage-scaled 
static  CMOS  with  aggressive  pipelining.  This  approach  is  often  combined  with  clock  gat¬ 
ing  to  reduce  the  dissipation  of  idle  flip-flops  and  branches  of  the  clock  tree.  In  these 
systems,  due  to  the  large  number  of  state  elements  (flip-flops)  and  loading  of  the  clock 
tree,  clock  tree  and  flip-flop  power  consumption  can  often  be  a  substantial  fraction  of  total 
system  dissipation. 


Figure  3:  Schematic  of  charge-recovering  flip-flop. 

This  project  has  led  to  the  discovery  of  a  novel  low-power  design  methodology  that 
relies  on  charge  recovery  in  conjunction  with  a  novel  flip-flop  and  a  novel  single-phase 
resonant  clock  generator  to  reduce  the  power  dissipation  associated  with  clock  distribution. 
The  charge-recovering  flip-flop,  shown  in  Figure  3,  has  several  properties  that  make  it 
well  suited  for  deeply  pipelined  low-voltage  static  CMOS  systems.  First,  it  exhibits  near 
zero  energy  consumption  while  idle  (i.e.,  when  the  data  input  switching  activity  is  zero). 
This  property  eliminates  the  need  for  clock-gating  logic,  yielding  similar  savings  as  fine¬ 
grained  clock  gating  at  every  flip-flop  in  the  design.  With  constant  input  data,  flip-flop 
dissipation  is  a  remarkable  4.9fJ/cycle  at  500MHz/1.5V,  or  1.8fJ/cycle  at  200MHz/1.0V 
in  a  0.25/mi  process.  Second,  the  energy  consumption  of  the  proposed  flip-flop  for  unit 
switching  activity  is  75fJ/cycle  at  200MHz,  making  it  competitive  with  other  low-energy, 
high-speed  flip-flops  [17,  30,  37].  Third,  the  proposed  flip-flop  is  very  compact,  containing 
only  14  transistors  in  its  minimal  configuration,  or  18  transistors  with  an  embedded  two- 
input  XOR  gate,  for  example.  It  is  furthermore  capable  of  operating  with  several  different 
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charge-recovering  power-clock  waveforms. 


power  bus 
V  to  ASIC 
f  core 


Figure  4:  Resonant  power-clock  generator. 

The  substantial  reduction  in  power  dissipation  achieved  by  the  proposed  flip-flop  is 
enabled  by  the  charge-recovering  power-clock  generator  shown  in  Figure  4.  The  proposed 
resonant  generator  has  several  features  making  it  particularly  suitable  for  driving  systems 
containing  the  charge-recovering  flip-flop.  First,  its  topology  enables  the  large  resonant 
currents  to  bypass  the  main  power  switch,  allowing  it  to  be  much  smaller  than  topologies 
where  the  entire  current  is  conducted  by  the  switch.  Second,  the  gate  drive  of  the  main 
switch  uses  an  efficient  dynamic  circuit  that  optimizes  the  tum-on  and  turn-off  rates  to 
minimize  dissipation.  Third,  a  compact  and  fast  control  circuit  enable  s  the  clock  generator 
to  decide,  on  a  cycle  by  cycle  basis,  whether  or  not  to  replenish  the  resonating  power- 
clock  energy.  This  capability  allows  the  clock  generator  to  maintain  a  stable  power-clock 
amplitude  while  consuming  as  little  energy  as  possible. 

To  evaluate  the  effectiveness  of  the  proposed  energy  recovering  design  methodology, 
a  conventional  ASIC  design  was  re-implemented  using  the  energy  recovering  flip-flop  and 
resonant  clock  generator.  A  direct  power  comparison  was  enabled  by  combining  a  con¬ 
ventional  static  CMOS  flip-flop  with  the  charge-recovering  flip-flop  through  a  multiplexer. 
A  conventional  buffered  clock  tree  was  used  for  the  static  CMOS  flip-flops,  while  a  wide 
metal  distribution  network  driven  by  the  resonant  power-clock  generator  was  used  for  the 
pTERF  flip-flops.  Our  simulation  results  of  this  voltage-scale  d  system  indicate  greater  than 
4  fold  savings  for  low  switching  activity  (when  the  state  elements  dominates  dissipation) 
and  a  20%  savings  for  high  switching  activity  (when  the  combinational  logic  dominates 
dissipation). 

2.3  Charge-recovering  SRAMs 

SRAMs  are  used  extensively  in  modern  processors  as  on-chip  memories  due  to  their  large 
storage  density  and  small  access  latency.  Low  power  on-chip  memories  have  become  the 
topic  of  substantial  research  as  they  can  account  for  almost  half  of  total  CPU  dissipation, 
even  for  extremely  power-efficient  designs  [13]. 

Charge  recovery  is  a  particularly  attractive  approach  to  the  reduction  of  power  dissipa¬ 
tion  in  high-density  memories  with  large  switching  capacitance.  Energy  recovery  schemes 
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reduce  energy  dissipation  by  limiting  voltage  differences  across  conducting  devices  and 
by  recovering  charges  from  the  load  capacitors.  This  controlled  mode  of  operation  is  typ¬ 
ically  accomplished  through  the  coordination  of  time-varying  voltage  waveforms,  called 
power-clocks.  Previous  energy  recovery  approaches  for  static  memories  achieved  consid¬ 
erable  energy  savings  over  conventional  SRAMs  [29,  33,  24,  2,  19,  25,  34].  These  schemes 
required  multiple-phase  power-clocks,  however,  and  experienced  a  variety  of  drawbacks, 
including  relatively  low  operating  frequencies,  long  latencies,  non-trivial  area  overheads, 
and  access-pattern  dependent  energy  savings. 

This  project  has  resulted  in  the  design  of  a  novel  low-power  charge  recovering  SRAM. 
With  its  fast  operation  and  low  overhead,  this  SRAM  is  suitable  for  on-chip  caches,  pro¬ 
viding  single-cycle  latency  read  and  write  operations,  and  avoiding  the  shortcomings  of 
previous  energy  recovering  approaches.  It  is  powered  by  a  single-phase  sinusoidal  power- 
clock  to  minimize  power-clock  generator  dissipation,  coupling  noise,  and  area  overhead. 

The  main  feature  of  the  proposed  SRAM  is  a  charge  recovering  driver  that  reclaims 
energy  from  the  capacitors  of  the  bit/word  lines.  A  small  control  circuit  embedded  to  the 
driver  keeps  its  operation  in  synchrony  with  the  power-clock.  Through  the  use  of  feedback 
from  the  driver  output  to  the  control  circuit,  the  operation  of  our  driver  remains  efficient, 
independent  of  the  operation  sequence.  To  provide  single-cycle  read  with  a  single-phase 
power-clock,  a  precharge-low  scheme  is  employed  in  conjunction  with  a  current-mode 
sense  amplifier  that  is  modified  to  operate  efficiently  near  Vss-  With  the  exception  of 
the  energy  recovering  drivers  and  the  modified  sense  amplifiers,  the  structure  of  our  static 
memory  is  identical  to  that  of  conventional  SRAMs. 

In  Hspice  simulations  of  a  256x256  SRAM  in  0.35/mi  TSMC  process,  our  energy  re¬ 
covering  memory  achieves  energy  savings  in  excess  of  2.6x  in  comparison  with  a  conven¬ 
tional  counterpart  at  3V,  300MHz.  Our  SRAM  functions  correctly  over  a  wide  range  of 
supply  voltages  and  operating  frequencies.  Maximum  operating  frequency  ranges  from 
1MHz  at  0.7V  to  over  500MHz  at  2.75V. 

2.4  Low-power  DRAMs 

Conventional  DRAMs  use  a  single  refresh  waveform  whose  period  is  dictated  by  the  mini¬ 
mum  data-retention  time  of  the  memory  cells.  To  ensure  that  the  state  of  all  cells  is  correct 
at  any  time,  the  refresh  period  t^EF  must  not  exceed  the  shortest  data-retention  time  t  ret- 
For  the  majority  of  DRAM  cells,  however,  data  retention  times  are  significantly  longer 
than  t ref-  Refreshing  these  cells  with  a  period  t ref  is  therefore  unnecessary  and  results 
in  excessive  power  dissipation. 

This  project  has  investigated  a  novel  scheme  for  reducing  data-retention  power  in  DRAMs 
by  using  multiple  refresh  periods  and  refreshing  on  a  block  basis.  Through  the  introduction 
of  multiple  refresh  periods,  cells  with  long  data-retention  times  can  be  refreshed  less  often 
than  leaky  cells  with  short  ones.  Moreover,  since  the  number  of  leaky  cells  is  typically 
small  [31],  by  refreshing  cells  in  relatively  small  groups,  as  opposed  to  entire  rows,  the 
refresh  period  of  each  block  is  increased.  To  further  extend  the  refresh  periods  of  refresh 
blocks,  a  swap  cell  is  introduced  to  provide  a  replacement  for  the  most  leaky  cell  in  each 
block. 

To  compute  an  optimal  set  of  refresh  periods  that  minimizes  refresh  power  dissipation, 


8 


a  novel  polynomial-time  algorithm  has  been  designed.  Specifically,  given  an  integer  K  and 
the  data  retention  times  of  the  refresh  blocks  in  the  memory,  this  algorithm  determines  in 
0(KN 2)  steps  a  set  of  K  refresh  periods  that  minimize  the  power  dissipation  due  to  re¬ 
freshing.  In  simulations  of  a  16Mb  DRAM,  the  proposed  block-based  multi-period  refresh 
reduces  power  dissipation  by  a  multiplicative  factor  of  4  with  an  area  overhead  of  at  most 
6%. 

Numerous  approaches  have  been  investigated  for  reducing  data  retention  power  in 
DRAMs.  Schemes  that  reduce  leakage  current  by  optimizing  process  conditions  or  by  con¬ 
trolling  the  potential  of  various  nodes  in  the  cell  have  been  reported  in  [9,  8,  7].  Schemes 
for  dynamically  controlling  the  refresh  period  through  a  limited  number  of  temperature 
or  current  sensors  have  been  reported  in  [11,  26].  The  use  of  memory  access  history  for 
reducing  the  number  of  refreshes  to  previously  accessed  rows  has  been  proposed  in  [27]. 
The  use  of  Error  Correcting  Codes  (ECCs)  to  control  the  number  of  errors  below  a  required 
level  while  setting  the  refresh  period  to  a  higher  value  was  reported  for  mass  storage  media 
in  [12].  The  introduction  of  a  second  refresh  period  for  rows  with  long  data  retention  times 
was  proposed  in  [10,  32].  These  papers  focus  on  implementation  issues,  however,  and  do 
not  present  systematic  techniques  for  the  optimal  selection  of  the  second  period. 

2.5  Reconfigurable  pipelining 

Pipelining  enables  the  realization  of  high-speed,  high-efficiency  CMOS  datapaths  by  al¬ 
lowing  the  reduction  of  supply  voltages  at  the  lowest  possible  levels,  while  still  satisfy¬ 
ing  throughput  constraints.  In  deep  pipelines,  however,  registers  and  corresponding  clock 
trees  are  responsible  for  an  increasingly  large  fraction  of  total  dissipation,  no  matter  how 
efficiently  they  may  have  been  implemented  [21,  23,  30,  35].  For  example,  the  power 
consumed  by  the  registers  of  a  PVQ  decoder  described  in  [23]  amounts  to  90%  of  total 
datapath  dissipation.  In  general,  these  registers  latch  their  inputs  unconditionally,  even  if 
input  data  do  not  change,  and  thus  consume  significant  power  no  matter  how  efficiently 
they  may  have  been  implemented  [21,  35,  30]. 

This  project  has  explored  a  novel  methodology  for  designing  fine-grain  reconfigurable 
pipelined  datapaths  that  can  adapt  their  performance  and  dissipation  to  required  data  rates 
in  real  time.  These  datapaths  can  efficiently  cope  with  the  variability  of  data  rate  that 
is  commonplace  in  numerous  applications.  The  proposed  reconfiguration  methodology 
reduces  energy  dissipation  by  disabling  and  bypassing  a  select  subset  of  registers.  The 
number  of  register  stages  and  corresponding  clock  trees  to  be  disabled  at  any  interval  in 
the  operation  of  the  pipeline  is  periodically  determined  by  the  amount  of  computation  that 
needs  to  be  performed  at  the  time.  Reconfiguration  can  be  performed  “on  the  fly”,  while 
data  is  streaming  through  the  datapath.  The  control  hardware  overhead  associated  with  our 
approach  is  very  low.  For  an  n-stage  pipeline,  additional  hardware  is  limited  to  O(nlgn) 
state  bits  and  0(n )  multiplexers. 

An  application  domain  that  naturally  lends  itself  to  the  proposed  real-time  fine-grain 
reconfiguration  scheme  is  video  processing,  a  key  component  of  multimedia  communi¬ 
cations  and  a  potentially  integral  part  of  next-generation  portable  devices.  Currently,  there 
are  several  video  standards  established  for  different  purposes,  including  MPEG-1,  MPEG2, 
and  H.261,  and  their  implementations  for  mobile  systems-on-a-chip  (SoC)  should  provide 
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substantial  computing  capabilities  at  low  energy  consumption  levels  [15].  The  building 
blocks  of  these  standards  include  demanding  computations  such  as  the  discrete  cosine 
transform  (DCT),  inverse  discrete  cosine  transform  (IDCT),  motion  estimation,  motion 
compensation,  variable-length  coding/decoding,  quantization,  and  inverse  quantization, 
video  streams  are  particularly  suitable  to  low-power  processing  using  our  reconfiguration 
approach,  because  the  required  data  rates  of  downstream  components  can  be  inferred  by 
observing  the  output  values  of  upstream  components.  Due  to  the  real-time  reconfiguration 
capability  of  our  scheme,  variable  data  rates  can  be  accommodated  without  interrupting  the 
flow  of  data  through  the  pipeline.  In  contrast,  alternative  dynamic  adaptation  schemes  such 
as  voltage  scaling  would  require  several  cycles  to  reconfigure  the  system  and  thus  result  in 
unacceptable  latencies  for  real-time  latency-sensitive  applications. 

To  evaluate  its  efficiency,  the  proposed  methodology  was  applied  to  the  design  of  re- 
configurable  pipelined  multipliers  that  were  used  in  IDCT  modules  with  varying  degrees 
of  parallelism.  The  multipliers  were  dynamically  reconfigured  according  to  the  number 
of  nonzero  DCT  coefficients  per  block  and  the  picture  size.  The  energy  efficiency  of  the 
reconfigurable  IDCTs  was  compared  with  that  of  statically  pipelined  IDCTs  with  identi¬ 
cal  architecture  and  peak  performance  capability.  In  simulations  with  a  0.35/rm  CMOS 
technology,  the  reconfigurable  pipelined  multipliers  for  the  2-dimensional  IDCT  achieved 
relative  reductions  up  to  65%  over  the  non-reconfigurable  counterparts. 


2.6  Low-power  Viterbi  decoder  design 


Viterbi  decoders  (VDs)  are  widely  used  in  digital  wireless  communication  systems  due  to 
their  powerful  error  correction  capabilities.  The  quality  of  a  VD  design  is  mainly  mea¬ 
sured  by  three  criteria:  coding  gain,  throughput,  and  power  dissipation.  High  coding  gain 
results  in  low  data  transfer  error  probability.  High  throughput  is  necessary  for  high-speed 
applications  such  as  802.11a  wireless  LAN.  The  design  of  VDs  with  high  coding  gain  and 
throughput  is  challenged  by  the  need  for  low  power,  however,  since  VDs  are  often  placed 
in  communication  systems  running  on  batteries. 
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Figure  5:  Performance  of  various  VDs. 

Single-chip  VD  design  has  been  a  very  active  research  area  for  the  past  15  years.  Fig¬ 
ure  5  shows  the  throughput,  power  dissipation,  and  number  of  states  for  a  large  collection 
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of  VD  designs  [3,  5,  14,  36].  In  general,  these  VDs  fall  within  the  region  between  the 
two  dashed  lines.  While  small-state  VDs  can  have  throughput  over  1  Gbps,  throughput 
decreases  with  the  increase  in  the  number  of  states.  Consequently,  the  design  of  high- 
throughput  large-state  VDs,  though  crucial  for  applications  such  as  satellite  communica¬ 
tions  [4],  has  remained  largely  unexplored. 

The  main  challenge  in  implementing  high-throughput  large-state  VDs  with  low  power 
dissipation  is  the  rapid  growth  in  the  computational  complexity  of  a  VD  with  the  increase 
in  its  state  count.  Although  parallel  arithmetic  processors  can  be  introduced  to  speed  up 
the  computation,  they  often  generate  extremely  complex  interconnect  routing  problems, 
degrading  both  system  throughput  and  power  dissipation.  Therefore,  achieving  high  speeds 
using  a  large-state  VD  with  low  power  dissipation  requires  careful  consideration  of  global 
data  transfer  and  interconnect  issues. 


Figure  6:  Layout  of  256-state  Viterbi  decoder. 

This  project  has  investigated  a  low-power  design  methodology  for  building  high-throughput 
large-state  VDs.  Furthermore,  it  has  applied  this  methodology  to  the  design  of  a  256-state 
rate- 1/3  IS95  VD  chip,  whose  layout  is  shown  in  Figure  6.  This  chip  achieves  high  power 
efficiency  due  to  our  comprehensive  design  approach  that  focuses  on  the  reduction  of  global 
data  transfers,  the  minimization  of  global  buses,  and  the  maximization  of  datapath  pipeline 
depths  with  no  data  forwarding  requirements.  The  proposed  design  methodology  results 
in  a  low-power  VD  architecture  consisting  of  8  parallel  processors  connected  by  4  global 
buses.  The  VD  was  synthesized  using  a  0.25  /im  standard  cell  library.  The  processors  were 
placed  in  an  array  structure  manually  and  then  routed  automatically.  In  simulations  from 
layout,  the  decoder  achieves  a  throughput  of  20  Mbps  while  dissipating  only  450  mW.  To 
our  knowledge,  this  dissipation  is  the  lowest  among  published  VDs  with  the  same  number 
of  states  and  throughput. 

2.7  System-level  power  estimation 

To  support  the  exploration  of  the  numerous  architectural  alternatives  that  arise  when  de¬ 
signing  with  intellectual  property  (IP)  components,  fast  and  accurate  design  automation 
tools  are  required  to  evaluate  key  design  characteristics  such  as  timing,  area,  and  power 
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dissipation.  Although  it  is  possible  for  IP  vendors  to  capture  timing  and  area  using  a  sin¬ 
gle  number,  the  characterization  of  power  in  a  simple  manner  has  remained  elusive,  since 
power  dissipation  is  input  data  dependent  [20].  Accurate  circuit-level  power  estimation 
tools  require  detailed  design  simulations  and  therefore  have  unacceptably  high  runtimes 
for  real-time  design  space  exploration.  On  the  other  hand,  fast  high-level  power  tools  suf¬ 
fer  from  relatively  large  estimation  errors.  Thus,  efficient  and  accurate  power  estimation 
for  IP-based  systems  remains  a  challenging  task. 

This  project  has  explored  a  novel  hybrid  power  estimation  methodology  for  programmable 
systems.  To  obtain  fast  and  accurate  estimates,  the  proposed  method  combines  high-level 
simulation  with  analytical  macromodeling  that  abstracts  circuit-level  characteristics  such  as 
switching  and  leakage  power.  Given  a  program  and/or  a  primary  input  sequence,  functional 
simulation  of  the  system  is  performed  to  derive  all  data  signals  at  the  datapath/memory  in¬ 
terface  as  well  as  all  control  signals.  This  simulation  does  not  explicitly  use  the  detailed 
structure  of  the  datapath.  Based  on  the  control  signals  derived,  different  possible  datap¬ 
ath  topologies  are  identified.  For  each  such  topology,  or  mode ,  a  fixed-point  iteration  is 
applied  in  conjunction  with  analytical  output  macromodeling  to  calculate  signal  statistics 
among  the  various  circuit  components.  These  statistics  are  then  applied  to  analytical  power 
macromodels  to  obtain  the  dissipation  of  the  entire  datapath  and  global  interconnects.  For 
control  and  memory,  power  estimates  are  obtained  using  the  signals  from  the  high-level 
simulation. 

The  proposed  methodology  relies  on  a  firm  theoretical  basis  and  yields  fast  and  accu¬ 
rate  power  estimates  in  practice.  Sufficient  conditions  on  the  macromodel  functions  have 
been  discovered  under  which  the  iterative  procedure  for  computing  signal  statistics  in  the 
datapath  is  guaranteed  to  converge  to  a  unique  point.  Fast  runtimes  and  high  estimation 
accuracies  are  accomplished  through  the  judicious  combination  of  high-level  simulation 
and  macromodeling  of  circuit-level  characteristics.  By  focusing  on  the  interfaces  among 
control,  datapath,  and  memory,  simulation  does  not  consider  details  of  the  datapath  struc¬ 
ture  and  thus  is  very  fast.  At  the  same  time,  it  accurately  captures  the  control  signals  that 
affect  the  dataflow  and,  therefore,  utilization  and  power  dissipation  of  hardware. 

The  proposed  hybrid  scheme  has  been  implemented  in  a  power  estimation  tool  called 
HyPE  and  has  been  used  it  to  explore  several  architectural  alternatives  in  two  system  de¬ 
signs  such  as  the  power  impact  of  computation  parallelism  and  loop  unrolling  on  the  design 
of  256-state  Viterbi  decoders  and  Rijndael  encryptors.  The  experimental  results  obtained 
demonstrate  the  high  effectiveness  of  the  proposed  approach.  For  systems  with  100k  logic 
gates,  HyPE  terminates  within  seconds.  Compared  with  state-of-the-art  industrial  gate- 
level  power  estimation  tools,  our  methodology  is  2  to  3  orders  of  magnitude  faster  with 
5.4%  power  estimation  deviation  on  the  average. 
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•  Ph.D.,  May  2001.  Kim  joined  IBM  Research  as  a  Member  of  the  Technical  Staff. 

Xun  Liu,  Graduate  Research  Assistant 

•  Ph.D.,  May  2003.  Liu  joined  North  Carolina  State  University  as  an  Assistant  Profes¬ 
sor  of  Electrical  and  Computer  Engineering. 

Conrad  Ziesler,  Graduate  Research  Assistant 

•  S.M.,  May  2002. 

Joohee  Kim,  Graduate  Research  Assistant 

Juang-Ying  Chueh,  Graduate  Research  Assistant 
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5  Report  of  inventions 

The  University  of  Michigan  filed  three  patent  applications  with  the  US  Patent  and  Trade¬ 
mark  Office  in  connection  with  the  following  three  inventions. 

•  Low-power  SRAM  with  energy  recovery.  File  No.  2300,  University  of  Michigan 
Technology  Transfer  Office,  March  2002. 

•  Single  phase  resonant  clock  generator  for  energy  recovering  systems,  Invention  Dis¬ 
closure  No.  2299,  University  of  Michigan  Technology  Transfer  Office,  March  2002. 

•  Low-power  flip-flop  with  energy  recovery  and  automatic  clock  gating.  Invention 
Disclosure  No.  2270,  University  of  Michigan  Technology  Transfer  Office,  March 
2002. 
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